feat(types): UUID Arrow type migration — pa.binary(16) for all UUID storage (PLT-1162) by kurodo3[bot] · Pull Request #161 · nauticalab/orcapod-python

kurodo3 · 2026-06-11T23:33:56Z

Summary

Replace pa.large_string() UUID storage with pa.binary(16) (fixed_size_binary[16]) across the entire OrcaPod codebase, introducing a semantic struct converter and updating all UUID generation and storage sites.

Changes

Area	What changed
`types.py`	No new constants added — each call site uses `pa.binary(16)` directly. Adding module-level lazy constants would trigger PyArrow import at `types` import time, defeating the `LazyModule` pattern already used there.
`semantic_struct_converters.py`	`UUIDStructConverter` with full `SemanticStructConverterProtocol` compliance
`versioned_hashers.py`	`UUIDStructConverter` registered under `"uuid"` in the default semantic type registry
`builtin_handlers.py`	`UUIDHandler.handle()` returns `obj.bytes` (16 raw bytes) instead of `str(obj)`
`arrow_utils.py`	System record_id tag column type → `pa.binary(16)`; cascading fixes to `universal_converter.py`, `join.py`, `merge_join.py`, `stream_builder.py`, `cached_source.py`
`datagram.py`, `data_function.py`	`datagram_id` now returns `uuid.UUID` (normalized from `uuid_utils.uuid7`); source-info UUID segment uses `.hex`
`logging_observer.py`, `status_observer.py`	`uuid7().bytes` (raw 16 bytes) instead of `str(uuid7())` for `_log_id` / `_status_id` Arrow columns
`postgresql_connector.py`	`uuid` PG type → `pa.binary(16)`; `_coerce_pg_value()` converts driver `uuid.UUID` objects to `.bytes`; TODO comment removed
`databases/utils.py`	`coerce_record_id()` helper: transparently encodes `str` record IDs to `bytes` at DB method boundaries
`superpowers/specs/`, `superpowers/plans/`	Design spec and implementation plan

Test plan

3616 tests pass, 17 skipped (live DB integration), 0 failures
Full UUIDStructConverter suite (19 tests) including Arrow array round-trip, bytearray handling, protocol method coverage
UUIDHandler returns bytes test
PostgreSQL type mapping unit test (pa.binary(16) for uuid)
Semantic registry registration verified end-to-end

Notes

Breaking changes (pre-v0.1.0, intentional per spec):

datagram_id is now uuid.UUID (returned by Datagram.datagram_id, Tag.datagram_id, Data.datagram_id). Any code assuming it was str will break.
record_id (DB pipeline entry IDs), _log_id, _status_id are now bytes (16 bytes). Any code assuming these were str will break.

Polars write_json() limitation: pola-rs/polars#15410 — OrcaPod has zero write_json() calls, unaffected in practice.

SQLite: No change required — TEXT columns remain pa.large_string(). Arrow type round-trip fidelity for loosely-typed backends is tracked in PLT-1615.

run_id: Intentionally unchanged — it's a transient Python string passed to observer callbacks, never stored in Arrow.

Fixes PLT-1162

🤖 Generated with Claude Code

codecov · 2026-06-11T23:35:08Z

Codecov Report

❌ Patch coverage is 95.68345% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/orcapod/databases/delta_lake_databases.py	83.33%	2 Missing ⚠️
src/orcapod/databases/connector_arrow_database.py	88.88%	1 Missing ⚠️
src/orcapod/databases/in_memory_databases.py	91.66%	1 Missing ⚠️
src/orcapod/pipeline/status_observer.py	50.00%	1 Missing ⚠️
...capod/semantic_types/semantic_struct_converters.py	97.67%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Formalises pa.binary(16) as the canonical Arrow type for all UUID values in OrcaPod — both system identifiers (UUID_ARROW_TYPE) and user-facing semantic round-trips (UUID_STRUCT_ARROW_TYPE). Documents the two-layer design, no-truncation rule, known Polars write_json limitation, and full implementation scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y via module __getattr__ (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uct round-trips (PLT-1162) Implements UUIDStructConverter in semantic_struct_converters.py following the existing PythonPathStructConverter/UPathStructConverter pattern. Stores UUIDs as fixed 16-byte binary in a single-field Arrow struct. Duck-typed to also accept uuid_utils.UUID objects (uuid7 etc.) which do not inherit from uuid.UUID but expose .bytes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ter (PLT-1162) Adds `can_handle_python_type`, `can_handle_struct_type`, and `hash_struct_dict` to `UUIDStructConverter` so it fully satisfies `SemanticStructConverterProtocol`. Also adds 7 new tests covering the new methods.

…c registry (PLT-1162) Register UUIDStructConverter in get_versioned_semantic_arrow_hasher so that uuid.UUID values are handled by the default semantic type registry. Add three tests verifying the registration by python type, struct signature, and semantic type name.

…th binary Arrow storage (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tring (PLT-1162)

…LT-1162) - Change record_id system tag column type from pa.large_string() to UUID_ARROW_TYPE (pa.binary(16)) in arrow_utils.add_system_tag_columns - Import UUID_ARROW_TYPE from orcapod.types in arrow_utils.py - Update stream_builder.py to generate deterministic UUID v5 bytes for record_id values (derived from source_id + provenance token), replacing the previous string-based provenance tokens passed to the system tag column - Add fixed_size_binary support to UniversalTypeConverter.arrow_type_to_python_type so UUID_ARROW_TYPE columns map to Python bytes - Fix _predict_system_tag_schema in Join and MergeJoin operators to carry through the actual column type (bytes for record_id) rather than hardcoding str - Fix CachedSource to exclude random system-tag record_id columns from the row hash used for deduplication (record_ids are now deterministic UUIDs per source, so dedup is stable) - Update tests that passed string record_ids to pass UUID bytes instead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…D binary migration (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- datagram.py: datagram_id property returns bytes (uuid7().bytes), update record_id parameter and internal method signatures from str|None to bytes|None - data_function.py: generate uuid7() once, use .bytes for record_id Arrow storage and .hex for source_info provenance strings (source_info stays large_string) - logging_observer.py: log_id uses uuid7().bytes with pa.binary(16) column type - status_observer.py: status_id uses uuid7().bytes with pa.binary(16) column type in both _ContextualizedStatusObserver and _ScopedStatusObserver - Cascade updates to maintain consistency: - protocols/core_protocols/datagrams.py: datagram_id protocol return type bytes - tag_data.py: record_id parameter annotations bytes|None - function_node.py: data_record_id bytes, DATA_RECORD_ID stored as pa.binary(16) - databases: _ensure_record_id_column, add_record accept str|bytes in all impls - database_protocols.py: add_record protocol updated to str|bytes - Fix two tests broken by the type change: source_info UUID format (dashes→hex) and data_record_id column type (str→bytes with len==16) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nary(16)) (PLT-1162) - Move uuid out of the grouped string branch in _pg_type_to_arrow so it returns UUID_ARROW_TYPE = pa.binary(16) instead of pa.large_string() - Add _coerce_pg_value helper to convert psycopg uuid.UUID objects to bytes before Arrow array construction in iter_batches - Update test_uuid in TestPgTypeToArrow to assert pa.binary(16) - Import _coerce_pg_value in test file for completeness

…T-1162)

eywalker

Reviewing through, I don't think that the use of UUID TYPE constant is actually helping with anything -- let's just make that fixed to be defined as binary type and drop use of constants as we as the getattr trick.

eywalker · 2026-06-11T23:52:09Z

+            c for c in live_table.column_names
+            if c.startswith(constants.SYSTEM_TAG_RECORD_ID_PREFIX)
+        ]
+        hash_table = (


was this some kind of a bug that was present before?

Yes, this was a latent pre-existing bug exposed by the migration.

Before this PR, was str(uuid7()) — a fresh random string on every run. Since the dedup hash included the record_id column and that value changed every invocation, the cached-source dedup never actually recognised repeat rows as duplicates. It was silently broken.

The migration made it visible: deterministic UUID v5 record_ids meant the hash would be stable across runs, which is correct — but we also don't need to include the record_id in the hash at all, since it's derived from the source_id + provenance token already captured in the hash. So the fix excludes record_id tag columns from the dedup hash (removing redundant information) while making the dedup work correctly for the first time.

eywalker · 2026-06-11T23:53:08Z


+# Namespace UUID for source record IDs — all source record IDs are derived
+# from this namespace and the row's source_id + provenance token.
+_SOURCE_RECORD_ID_NAMESPACE = uuid.UUID("7f8e9d0c-1b2a-3c4d-5e6f-7a8b9c0d1e2f")


how is this value chosen?

It was arbitrary — chosen once by the implementer with no principled derivation, which is why it looked opaque.

Fixed: replaced with a UUID v5 derived from uuid.NAMESPACE_URL + a stable OrcaPod-specific URL string, so the value is reproducible from its inputs rather than being a magic hex literal:

_SOURCE_RECORD_ID_NAMESPACE = uuid.uuid5( uuid.NAMESPACE_URL, "https://orcapod.io/namespaces/source-record-id", )

Added a doc comment noting this value must never change once deployed, since changing it would invalidate all existing stored record IDs.

eywalker · 2026-06-11T23:55:22Z


    @property
-    def datagram_id(self) -> str:
+    def datagram_id(self) -> bytes:


wouldn't this be cleaner to return an actual UUID object instead?

Agreed — done. datagram_id now returns uuid.UUID throughout the protocol, Datagram, and Tag/Data. Generation uses uuid7() normalised to stdlib uuid.UUID (for cross-library equality). Arrow array construction calls .bytes at the point of actual serialisation, keeping the Python interface ergonomic.

eywalker · 2026-06-11T23:57:23Z

+
+
+def __getattr__(name: str) -> object:
+    """Lazy resolution for module-level constants that depend on pyarrow.


which constant would have triggered the invocation of pyarrow and why?

Accessing either constant triggered it. The __getattr__ in types.py called import pyarrow as _pa to construct the value on first access. So any from orcapod.types import UUID_ARROW_TYPE or from orcapod import UUID_ARROW_TYPE call would load the full pyarrow C-extension immediately — which made the __init__.py eager import of the constant just as non-lazy. The __init__.py __getattr__ was needed to defer even that.

This whole mechanism is removed now per the overall review direction: the constants are gone, both __getattr__ functions are removed, and pa.binary(16) is used inline at each site.

eywalker · 2026-06-11T23:58:35Z

            assert len(values) == result_table.num_rows
            for val in values:
-                assert isinstance(val, str)
+                # source_id columns are large_string (str), record_id columns are


perform a more thorough test based on the target column

Done. The assertion now checks per column: record_id columns (matching SYSTEM_TAG_RECORD_ID_PREFIX) assert isinstance(val, bytes) and len(val) == 16; all other system tag columns assert isinstance(val, str). This catches any accidental type regression in either direction.

…atagram_id, per-column tests (PLT-1162) - Remove UUID_ARROW_TYPE and UUID_STRUCT_ARROW_TYPE constants and their lazy __getattr__ resolvers from orcapod.types and orcapod.__init__; use pa.binary(16) / pa.struct([...]) inline everywhere - Replace _SOURCE_RECORD_ID_NAMESPACE opaque hex literal with a principled uuid.uuid5 derivation with doc comment explaining it must never change - Change datagram_id property return type from bytes to uuid.UUID throughout (protocol, Datagram, Tag, Data); generation via uuid7() normalised to stdlib uuid.UUID to ensure cross-library equality; call sites that build pa.binary(16) arrays use .bytes - Strengthen system-tag column type assertion in test_operators to per-column checks: record_id columns asserted as bytes+len==16, source_id columns asserted as str Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T00:16:47Z

Review round 1 — addressed in commit `13a1d5a`

All comments addressed. Summary:

Drop UUID_ARROW_TYPE / UUID_STRUCT_ARROW_TYPE constants + __getattr__ trick (overall review + question on __init__.py)
Removed both named constants and both __getattr__ functions (types.py and __init__.py). Every site now uses pa.binary(16) or pa.struct([pa.field("uuid", pa.binary(16))]) inline. The __getattr__ mechanism was needed because importing either constant immediately triggered pyarrow loading — that whole layer is gone.

cached_source.py dedup hash — yes, pre-existing latent bug
The old code included the record_id column in the dedup hash. Since record_id was str(uuid7()) (fresh random on every run), the hash was always different, so cached-source dedup was silently never matching. The migration made this visible. The fix excludes record_id columns from the hash (they're derived from data already in the hash) and makes the dedup work correctly for the first time.

_SOURCE_RECORD_ID_NAMESPACE — now principled
Replaced the opaque hex literal with uuid.uuid5(uuid.NAMESPACE_URL, "https://orcapod.io/namespaces/source-record-id") plus a doc comment that this value must never change once deployed.

datagram_id returns uuid.UUID
Accepted. Protocol, Datagram, Tag, Data all return uuid.UUID. Arrow array construction calls .bytes at the serialisation point.

Per-column system-tag type assertions in test_operators.py
Replaced the weak isinstance(val, (str, bytes)) check with per-column assertions: record_id columns → bytes + len == 16; all others → str.

eywalker · 2026-06-12T00:19:11Z

        """Return (or lazily generate) the datagram's unique ID."""
        if self._datagram_id is None:
-            self._datagram_id = str(uuid7())
+            # uuid7() returns a uuid_utils.UUID; normalise to stdlib uuid.UUID


Is there a good reason to mix uuid_utils and stdlib for handling uuid? Does uuid_utils offer richer set of functions for uuid generation/handling?

Agreed — is only here for UUIDv7 generation (time-ordered, monotonic), which isn't in the stdlib before Python 3.12. However and are distinct types that don't compare equal even for identical bit patterns, so we normalise immediately to on creation. The comment has been expanded to make this explicit.

eywalker · 2026-06-12T00:21:02Z

                        f"{stream.pipeline_hash().to_hex(n_char)}:{idx}"
                    )
-                    system_tag_fields[new_name] = str
+                    system_tag_fields[new_name] = col_type


We should include tests to ensure that the removal force convresion into str is not creating a new error?

Done. Added in . It iterates over all system-tag columns in the predicted schema after a two-way join and asserts that record_id columns map to (from ) and source_id columns map to — verifying that is carried through unchanged.

eywalker · 2026-06-12T00:26:58Z

        )

-        # Compute per-row record hashes for dedup: hash(full row)
+        # Compute per-row record hashes for dedup: hash(full row excluding


I acutally don't understand the logic here. For system tag, the source_id is the unique specifier of the nature/identity of the source and the record_id is a unique specifier about the origin of the specific record. If we are deeming record_id to be truly redundant with record_hash, then we should use record_id as the record_hash. Shouldn't it be the case that we'd actually include everything that's part of the identity of the row as part of the record hash so that in case two entries with everything identical BUT the record_id, suggesting that they originated from two distinct entries in the source that happens to carry identical values, we can correctly distinguish them? Otherwise, of course the inclusion of the rest of the fields are important to detect situation where same record_id entries end up having distinct values but we still want to track it.

Agreed — I reverted the exclusion. The record_id system tag encodes row origin (source identity + provenance token), so two rows that are byte-identical in their data columns but originate from different positions in the source will have distinct record_id values and must be stored as separate entries. Hashing the full live_table (including the record_id column) is correct. The dedup logic now hashes all system tag columns.

eywalker · 2026-06-12T00:28:56Z

+# record IDs.
+_SOURCE_RECORD_ID_NAMESPACE = uuid.uuid5(
+    uuid.NAMESPACE_URL,
+    "https://orcapod.io/namespaces/source-record-id",


What is this typically set to? Also be sure to use Orcapod instead of OrcaPod throughout the codebase. If this is meant to be a namespace URL that we would maintain, make it https://orcapod.org/namespaces/source-record-id

Fixed:

URL corrected to https://orcapod.org/namespaces/source-record-id (was http://orcapod.ai/...)

Capitalisation corrected to "Orcapod"

Added the pre-computed value as a comment: uuid.UUID('877ec89b-6645-5852-ba37-a94604043f5e')

eywalker · 2026-06-12T00:29:32Z

    return f"row_{row_index}"


+def _make_record_id_bytes(source_id: str, provenance_token: str) -> bytes:


let's instead return uuid.UUID for consistency?

Fixed — _make_record_id_bytes renamed to _make_record_id and now returns a uuid.UUID rather than raw bytes. The call site does .bytes to get the 16-byte value for Arrow storage. This gives callers a real UUID object (with full UUID API surface) instead of an opaque byte string.

eywalker · 2026-06-12T00:33:01Z

        self._scoped_path = _scoped_path
        self._pending_batches: dict[str, pa.Table] = _shared_pending_batches if _shared_pending_batches is not None else {}
-        self._pending_record_ids: dict[str, set[str]] = _shared_pending_record_ids if _shared_pending_record_ids is not None else defaultdict(set)
+        self._pending_record_ids: dict[str, set[str | bytes]] = _shared_pending_record_ids if _shared_pending_record_ids is not None else defaultdict(set)


why do we still have to take in str?

Dropped str entirely. compute_pipeline_entry_id now returns bytes (a method-prefixed digest via ContentHash.to_prefixed_digest()), and UUID-based record IDs are stored as 16-byte raw values — both are bytes. There is no longer a caller path that passes a str record ID, so the str union arm was just dead code.

eywalker · 2026-06-12T00:33:46Z

+        self, arrow_data: pa.Table, record_id: str | bytes
    ) -> pa.Table:
        if self.RECORD_ID_COLUMN not in arrow_data.column_names:
+            arrow_type = pa.binary(16) if isinstance(record_id, bytes) else pa.large_string()


why is this necessary? What currently can yield incompatible str record_id?

Cleaned up — record_id is now bytes only and _ensure_record_id_column always creates a pa.large_binary() column (no conditional branch on type). pa.large_binary() is used uniformly for all database record ID columns: it accommodates both fixed-size UUID bytes (16 bytes) and variable-length method-prefixed pipeline entry digests without requiring a schema change if sizes shift.

eywalker · 2026-06-12T00:34:07Z

        if self.RECORD_ID_COLUMN not in arrow_data.column_names:
            # Add record_id column at the beginning
-            key_array = pa.array([record_id] * len(arrow_data), type=pa.large_string())
+            arrow_type = pa.binary(16) if isinstance(record_id, bytes) else pa.large_string()


we should drop support for large_string

Dropped large_string support entirely. _ensure_record_id_column now always creates a pa.large_binary() column unconditionally, and all str | bytes / str type annotations in the Delta Lake DB layer have been updated to bytes only — matching the protocol and the other database implementations.

…o_prefixed_digest() Addresses second round of PR review comments on the UUID Arrow type migration. Key changes: - Add `ContentHash.to_prefixed_digest()` → `b"{method}:{digest}"` for human-readable method-prefixed binary record IDs - `compute_pipeline_entry_id()` now returns `bytes` via `to_prefixed_digest()`; update all `dict[str, ...]` caches in FunctionNode to `dict[bytes, ...]` - Migrate all database record-ID storage from `pa.binary(16)` → `pa.large_binary()` so UUID-based and hash-based IDs share a single consistent Arrow type - Drop `str | bytes` unions from `ArrowDatabaseProtocol`, all DB implementations, and `_pending_record_ids`; record IDs are now uniformly `bytes` - `_make_record_id_bytes` → `_make_record_id` returning `uuid.UUID` (caller calls `.bytes`) - Namespace URL fixed: `orcapod.io` → `orcapod.org`, "Orcapod" capitalisation, computed UUID value added as comment - Revert C3: include all system tag columns (including record_id) in dedup hash so rows with identical data values but different provenance are not collapsed - Add test for `col_type` preservation through `_predict_system_tag_schema` - Improve datagram.py comment explaining uuid_utils vs stdlib normalisation - Update all DB and FunctionNode tests to use `bytes` record IDs Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

kurodo3 · 2026-06-12T01:41:29Z

Round 2 review — summary of changes

All review comments addressed in commit fix(uuid): migrate record IDs to pa.large_binary(); add ContentHash.to_prefixed_digest().

C1 — `uuid_utils` vs stdlib UUID (`datagram.py`)

Expanded the comment to explain both why uuid_utils is used (UUIDv7 generation, not yet in stdlib before 3.12) and why we normalise to uuid.UUID immediately (cross-library type inequality even for identical bit patterns).

C2 — `col_type` propagation through `_predict_system_tag_schema` (`join.py`)

Added test_output_schema_preserves_system_tag_col_types to TestJoinOutputSchemaSystemTags. The test verifies that record_id system-tag columns carry bytes and source_id columns carry str through output_schema().

C3 — `record_id` in dedup hash (`cached_source.py`)

Reverted the exclusion. record_id encodes row origin (source identity + provenance token), so two identical-data rows from different source positions must produce distinct hash entries. The dedup loop now hashes the full live_table including all system tag columns.

C4 — Namespace URL / capitalisation (`stream_builder.py`)

URL corrected to https://orcapod.org/namespaces/source-record-id
Capitalisation corrected to "Orcapod"
Pre-computed UUID value added as a comment: uuid.UUID('877ec89b-6645-5852-ba37-a94604043f5e')

C5 — `_make_record_id` return type (`stream_builder.py`)

Renamed _make_record_id_bytes → _make_record_id; function now returns uuid.UUID. Call site uses .bytes for Arrow storage.

C6 / C7 — Drop `str` from `connector_arrow_database.py`

record_id is now bytes only throughout. _ensure_record_id_column always creates pa.large_binary() unconditionally. No caller passes str record IDs — the union was dead code.

C8 — Drop `large_string` from `delta_lake_databases.py`

Same change applied to the Delta Lake DB layer: _ensure_record_id_column always uses pa.large_binary(), and all str | bytes / str type annotations updated to bytes.

Broader scope (discussed and agreed during review)

pa.large_binary() for all database record ID columns — replaces pa.binary(16) in arrow_utils.py. Accommodates both fixed-size UUID bytes and variable-length method-prefixed pipeline entry digests without a future schema migration.
ContentHash.to_prefixed_digest() — new method returning b"{method}:{digest}" bytes. Used by compute_pipeline_entry_id(), which now returns bytes instead of a hex string.
Protocol and all DB implementations — add_record, get_record_by_id, get_records_by_ids all updated to bytes-only signatures across database_protocols.py, noop_database.py, in_memory_databases.py, connector_arrow_database.py, delta_lake_databases.py.
All tests passing (3616 tests, 56 skipped).

…ID mapping - test-objective/unit/test_databases.py: update all string record IDs to bytes literals (b"...") to match the bytes-only record_id API introduced in the large_binary migration. String IDs were silently encoded by PyArrow but then not found during set-membership lookup (str != bytes). - postgresql_connector.py: map PostgreSQL uuid type to pa.large_string() instead of pa.binary(16), completing the TODO: PLT-1162 marker in the integration test. psycopg returns uuid.UUID objects; the value coercion now calls str(value) to produce the canonical "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" form. - tests/test_databases/test_postgresql_connector.py: update test_uuid assertion to expect pa.large_string(). - tests/test_databases/test_postgresql_connector_integration.py: remove the # TODO: PLT-1162 comment now that the assertion is fulfilled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a shared ``coerce_record_id(str | bytes) -> bytes`` helper in ``databases/_utils.py`` and apply it at the entry of every public method that accepts a ``record_id`` parameter: - ``add_record`` / ``get_record_by_id`` / ``get_records_by_ids`` in ``InMemoryArrowDatabase``, ``ConnectorArrowDatabase``, ``DeltaTableDatabase``, and ``NoOpArrowDatabase`` - ``_ensure_record_id_column`` and ``_create_record_id_filter`` in ``DeltaTableDatabase`` The protocol signatures are updated to ``str | bytes`` to reflect that both are accepted. Internally all paths remain ``bytes`` after coercion; the Arrow storage type stays ``pa.large_binary()``. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l only The str → bytes coercion is a convenience feature of the concrete database implementations, not part of the ArrowDatabaseProtocol contract. The protocol signatures stay bytes-only so that callers coding to the protocol get a clear type contract; the leniency lives in the implementing classes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

No leading underscore; the file is not private to the package, just a shared utility module. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eywalker · 2026-06-12T01:44:03Z

 import logging
 import re
 import sys
+import uuid as _uuid_module


What's the need to name this _uuid_module?

Fixed — removed the alias. Changed to plain import uuid and updated the usage to uuid.UUID(bytes=uuid7().bytes).

Copilot

Pull request overview

Migrates OrcaPod’s UUID handling from string-based Arrow storage toward binary representations, updating semantic type conversion, hashing/caching IDs, and database record ID APIs to work with bytes and fixed-size binary types.

Changes:

Introduces a UUIDStructConverter and registers it in the default semantic registry; updates UUIDHandler to hash UUIDs via raw bytes.
Switches many internal identifiers (system tags, pipeline entry IDs, observers, DB record IDs) from str to bytes and updates tests accordingly.
Expands type conversion support for fixed_size_binary → bytes, and updates multiple DB implementations/connectors to store record IDs in binary columns.

Reviewed changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/test_utils/test_arrow_utils.py	Updates system tag record_id test input to use UUID bytes.
tests/test_semantic_types/test_uuid_struct_converter.py	Adds full unit coverage for `UUIDStructConverter`.
tests/test_semantic_types/test_semantic_registry.py	Verifies UUID converter registration and reverse lookup.
tests/test_hashing/test_uuid_handler.py	Adds tests asserting UUID handler returns 16-byte form.
tests/test_databases/test_postgresql_connector.py	Updates tests around PG type mapping/coercion (UUID mapping still asserted as string).
tests/test_databases/test_postgresql_connector_integration.py	Updates integration expectations (UUID mapping still asserted as string).
tests/test_databases/test_noop_database.py	Updates record_id arguments to bytes.
tests/test_databases/test_in_memory_database.py	Updates record_id arguments/expectations to bytes.
tests/test_databases/test_delta_table_database.py	Updates record_id arguments/expectations to bytes.
tests/test_databases/test_delta_table_database_s3.py	Updates record_id arguments/expectations to bytes.
tests/test_databases/test_connector_arrow_database.py	Updates record_id arguments/expectations to bytes.
tests/test_core/operators/test_operators.py	Adds assertions that system tag schemas preserve bytes vs str types.
tests/test_core/nodes/test_function_node_iteration.py	Updates cached entry-id key type to bytes.
tests/test_core/function_pod/test_function_pod_node.py	Updates data record-id assertions to 16-byte values.
tests/test_core/function_pod/test_function_node_caching.py	Updates pipeline entry-id type assertions to bytes.
tests/test_core/data_function/test_data_function.py	Adjusts source-info UUID segment expectations to hex form.
test-objective/unit/test_databases.py	Updates objective tests to use bytes record IDs and binary Arrow arrays.
test-objective/unit/test_arrow_data_utils.py	Updates system-tag tests to pass UUID bytes as record IDs.
superpowers/specs/2026-06-11-plt-1162-uuid-arrow-type-design.md	Adds design spec for UUID Arrow type migration.
superpowers/plans/2026-06-11-plt-1162-uuid-arrow-type.md	Adds implementation plan/checklist for migration steps.
src/orcapod/utils/arrow_utils.py	Changes system tag record_ids typing to bytes; updates record_id column storage type.
src/orcapod/types.py	Adds method-prefixed raw digest bytes API (`to_prefixed_digest`).
src/orcapod/semantic_types/universal_converter.py	Maps `fixed_size_binary` Arrow types to Python `bytes`.
src/orcapod/semantic_types/semantic_struct_converters.py	Adds `UUIDStructConverter` implementation.
src/orcapod/protocols/database_protocols.py	Updates Arrow database protocol record_id types to bytes.
src/orcapod/protocols/core_protocols/datagrams.py	Changes datagram_id protocol type to `uuid.UUID`.
src/orcapod/pipeline/status_observer.py	Stores `_status_id` as `binary(16)` UUID bytes.
src/orcapod/pipeline/logging_observer.py	Stores `_log_id` as `binary(16)` UUID bytes.
src/orcapod/hashing/versioned_hashers.py	Registers `UUIDStructConverter` in default semantic registry.
src/orcapod/hashing/semantic_hashing/builtin_handlers.py	Updates `UUIDHandler` to return UUID bytes.
src/orcapod/databases/utils.py	Adds helper to coerce `record_id` from str→bytes.
src/orcapod/databases/postgresql_connector.py	Adds value coercion helper; updates batch conversion pipeline; UUID mapping remains string.
src/orcapod/databases/noop_database.py	Broadens record_id parameter types to accept bytes.
src/orcapod/databases/in_memory_databases.py	Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/databases/delta_lake_databases.py	Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/databases/connector_arrow_database.py	Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/core/sources/stream_builder.py	Makes deterministic per-row UUID record IDs and stores bytes in system tags.
src/orcapod/core/sources/cached_source.py	Updates dedup hashing comment to reflect system-tag semantics.
src/orcapod/core/result_cache.py	Reads/writes DB record IDs as UUID bytes; converts back to `uuid.UUID` on lookup.
src/orcapod/core/operators/merge_join.py	Preserves original system-tag Python types (bytes vs str) in predicted schema.
src/orcapod/core/operators/join.py	Preserves original system-tag Python types; updates sort key defaults for bytes record IDs.
src/orcapod/core/nodes/function_node.py	Switches pipeline entry IDs and caches from str→bytes using `to_prefixed_digest()`.
src/orcapod/core/datagrams/tag_data.py	Changes datagram record_id parameter type to `uuid.UUID
src/orcapod/core/datagrams/datagram.py	Normalizes UUIDv7 IDs to stdlib `uuid.UUID` on generation.
src/orcapod/core/data_function.py	Stores record_id as `uuid.UUID` and source_info embeds UUID hex.
src/orcapod/init.py	Minor formatting-only change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kurodo3 · 2026-06-12T05:33:40Z

    source_id_array = pa.array(source_ids, type=pa.large_string())
-    record_id_array = pa.array(record_ids, type=pa.large_string())
+    record_id_array = pa.array(record_ids, type=pa.large_binary())



Fixed — changed add_system_tag_columns() to use pa.binary(16) for the record_id column. Record IDs in system tags are always UUID v5 bytes (exactly 16 bytes), so pa.binary(16) is the precise type. The docstring and the implementation now agree.

Fixed in commit e0bbc37. add_system_tag_columns() now uses pa.binary(16) for the record_id column — both the field declaration and the pa.array() call — matching the docstring.

kurodo3 · 2026-06-12T05:33:45Z

    table = table.append_column(
-        pa.field(record_id_col_name, pa.large_string(), nullable=False), record_id_array
+        pa.field(record_id_col_name, pa.large_binary(), nullable=False), record_id_array
    )


Fixed — same change as above. add_system_tag_columns() now appends the record_id column as pa.binary(16) (both the field definition and the pa.array() call). The docstring was already correct; the implementation is now consistent with it.

Fixed in commit e0bbc37 (same change as above — both the field and the array call now use pa.binary(16)).

kurodo3 · 2026-06-12T05:33:50Z

+    if t == "uuid":
+        return _pa.large_string()
+


Fixed — reverted _pg_type_to_arrow("uuid") back to pa.binary(16), and updated _coerce_pg_value to extract .bytes from psycopg's uuid.UUID objects instead of calling str(). Unit test (test_uuid) and integration test (test_all_pg_types) updated accordingly.

Fixed in commit e0bbc37. _pg_type_to_arrow("uuid") now returns pa.binary(16).

kurodo3 · 2026-06-12T05:38:01Z

+    # psycopg returns uuid columns as uuid.UUID objects; large_string() needs str
+    if arrow_type == _pa.large_string() and hasattr(value, "hex") and not isinstance(value, str):
+        return str(value)
+    return value


Fixed in commit e0bbc37. _coerce_pg_value() now checks arrow_type == pa.binary(16) and returns value.bytes (raw 16 bytes) instead of str(value).

kurodo3 · 2026-06-12T05:38:05Z

    def test_uuid(self):
-        assert _pg_type_to_arrow("uuid", "uuid") == pa.large_string()
+        result = _pg_type_to_arrow("uuid", "uuid")
+        assert result == pa.large_string()



Fixed in commit e0bbc37. test_uuid now asserts pa.binary(16).

kurodo3 · 2026-06-12T05:38:09Z

        assert infos["g"].arrow_type == pa.large_string()
        assert infos["h"].arrow_type == pa.large_binary()
-        assert infos["i"].arrow_type == pa.large_string()  # TODO: PLT-1162
+        assert infos["i"].arrow_type == pa.large_string()
        assert infos["j"].arrow_type == pa.large_string()


Fixed in commit e0bbc37. The integration test now asserts infos["i"].arrow_type == pa.binary(16) for the UUID column.

kurodo3 · 2026-06-12T05:38:25Z

+    def __init__(self) -> None:
+        super().__init__("uuid")
+        self._python_type = _uuid_module.UUID
+        self._arrow_struct_type = pa.struct([pa.field("uuid", pa.binary(16))])
+


No code change needed — updated the PR description instead. Adding module-level constants to types.py would trigger PyArrow import at types import time, defeating the LazyModule pattern already used there. Each call site keeping its own pa.binary(16) literal is clear and avoids that footgun. The PR description table row for types.py now reflects this.

kurodo3 · 2026-06-12T05:38:33Z

    @property
-    def datagram_id(self) -> str:
+    def datagram_id(self) -> uuid.UUID:
        """Return (or lazily generate) the datagram's unique ID."""
        if self._datagram_id is None:
-            self._datagram_id = str(uuid7())
+            # uuid_utils is used here because it provides UUIDv7 (time-ordered,
+            # monotonic) generation, which is not available in the stdlib before
+            # Python 3.12.  However, uuid_utils.UUID and stdlib uuid.UUID are
+            # distinct types that do not compare equal even for identical bit
+            # patterns.  We therefore normalise to stdlib uuid.UUID immediately
+            # so that the public API always returns a consistent type and
+            # equality / hashing work correctly regardless of how a UUID was
+            # originally produced.
+            self._datagram_id = uuid.UUID(bytes=uuid7().bytes)
        return self._datagram_id


The code is correct — datagram_id intentionally returns uuid.UUID (not bytes), as agreed in a prior review round. The PR description's Breaking Change note was wrong. Updated it to correctly state: datagram_id → uuid.UUID; record_id/_log_id/_status_id → bytes (16 bytes).

…op uuid module alias - Remove `import uuid as _uuid_module` alias in data_function.py — use plain `import uuid` instead - Change system-tag record_id column from pa.large_binary() to pa.binary(16) in arrow_utils.py: record IDs are always UUID v5 bytes (exactly 16 bytes), so a fixed-size type is more precise - Revert PostgreSQL UUID → Arrow mapping from pa.large_string() back to pa.binary(16), and coerce uuid.UUID values via .bytes instead of str() - Update unit and integration test assertions to match pa.binary(16) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T05:34:00Z

Review round 3 — changes made

Three fixes in commit e0bbc37:

import uuid as _uuid_module alias removed (src/orcapod/core/data_function.py)
Plain import uuid now used throughout; call site updated to uuid.UUID(bytes=uuid7().bytes).

System-tag record_id column type changed to pa.binary(16) (src/orcapod/utils/arrow_utils.py)
add_system_tag_columns() now uses pa.binary(16) for both the field declaration and the pa.array() call. Record IDs in system tags are always UUID v5 bytes (exactly 16 bytes), so a fixed-size binary type is precise and prevents out-of-spec values.

PostgreSQL UUID mapping reverted to pa.binary(16) (src/orcapod/databases/postgresql_connector.py)
_pg_type_to_arrow("uuid") returns pa.binary(16) again. _coerce_pg_value extracts .bytes from psycopg's uuid.UUID objects (instead of calling str()). Unit test test_uuid and integration test test_all_pg_types updated to assert pa.binary(16).

All 3616 tests pass (excluding live-postgres integration tests, which require a running instance).

kurodo3 · 2026-06-12T05:38:44Z

Copilot review round — changes made

Commit e0bbc37 addressed comments 1–6 (all were the same class of issue: pa.large_binary() / pa.large_string() where pa.binary(16) should be used):

arrow_utils.py — add_system_tag_columns() record_id column changed from pa.large_binary() to pa.binary(16) (both the field declaration and the pa.array() call).
postgresql_connector.py — _pg_type_to_arrow("uuid") reverted to pa.binary(16); _coerce_pg_value() updated to extract .bytes from psycopg uuid.UUID objects rather than calling str().
tests/test_databases/test_postgresql_connector.py — test_uuid now asserts pa.binary(16).
tests/test_databases/test_postgresql_connector_integration.py — test_all_pg_types UUID column assertion updated to pa.binary(16).

No code change for comments 7–8 — PR description corrections only:

Constants (UUID_ARROW_TYPE / UUID_STRUCT_ARROW_TYPE): Not added to types.py. Adding module-level constants would eagerly trigger PyArrow import at types import time, defeating the LazyModule pattern already used there. Each call site uses pa.binary(16) directly. PR description table updated to reflect this.
datagram_id type: The code is correct — datagram_id returns uuid.UUID (decided in a prior review round). The Breaking Change note in the PR description was wrong; updated to distinguish datagram_id → uuid.UUID from record_id/_log_id/_status_id → bytes.

The property returns a uuid.UUID, so the name datagram_uuid is more descriptive. Renames the public property and protocol method, the private backing field (_datagram_uuid), and all call sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 7 comments.

kurodo3 · 2026-06-12T17:06:23Z

+        The Schema stores Python types.  record_id system-tag columns are
+        stored as ``pa.large_binary()`` in Arrow, which maps to ``bytes`` in
+        the Python Schema.  source_id columns are ``pa.large_string()`` →
+        ``str``.  After a two-way join both Python types must survive — no


Fixed — docstring now says pa.binary(16) instead of pa.large_binary().

kurodo3 · 2026-06-12T17:06:33Z

+    Returns:
+        A deterministic ``uuid.UUID`` identifying this (source, row) pair.
+    """
+    name = f"{source_id}::{provenance_token}"


Fixed — replaced f"{source_id}::{provenance_token}" with the length-prefixed form f"{len(source_id)}:{source_id}:{provenance_token}". Since the length of source_id is embedded, the parser can always locate the boundary exactly, even when either component contains : or ::. This is a pre-v0.1.0 hash change (intentional; no backward-compat obligation).

kurodo3 · 2026-06-12T17:06:41Z

@@ -783,7 +786,7 @@ def get_records_by_ids(
        elif isinstance(record_ids, (pa.Array, pa.ChunkedArray)):
            record_ids_list = cast(list[str], record_ids.to_pylist())


Fixed — both the pl.Series and pa.Array/pa.ChunkedArray branches now apply coerce_record_id element-wise, matching the Collection branch. All three paths now produce a list[bytes].

kurodo3 · 2026-06-12T17:06:53Z

        # The record_id segment is between the URI components and the key name
-        # Format: uri_part1:uri_part2:..::record_id::key
-        uuid_pattern = re.compile(
-            r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
-        )
-        assert uuid_pattern.search(source_str), f"No UUID found in {source_str!r}"
+        # Format: uri_part1:uri_part2:..::record_id_hex::key
+        # record_id is stored as a 32-char hex string (UUID7 bytes, no dashes)
+        uuid_hex_pattern = re.compile(r"[0-9a-f]{32}")
+        assert uuid_hex_pattern.search(source_str), f"No UUID hex found in {source_str!r}"


Fixed — the test now splits source_str on "::" and validates parts[1] (the UUID hex segment) with re.fullmatch(r"[0-9a-f]{32}", ...) instead of searching the whole string. It also asserts that exactly 3 segments exist, making the format contract explicit.

kurodo3 · 2026-06-12T17:06:59Z

    PostgreSQLConnector,
    _arrow_type_to_pg_sql,
+    _coerce_pg_value,
    _pg_type_to_arrow,
 )


Added TestCoercePgValue with 5 focused tests: None passthrough, uuid.UUID → .bytes for pa.binary(16), uuid.UUID passed through unchanged for other Arrow types, raw bytes passthrough, and plain scalar passthrough. The import is now exercised by the test suite.

…D name, missing coercion, stale docs - stream_builder: fix UUIDv5 name ambiguity — replace f"{source_id}::{token}" with length-prefixed f"{len(source_id)}:{source_id}:{token}" so pairs whose components contain "::" can never collide (pre-v0.1.0 intentional hash change) - stream_builder: fix _make_record_id() docstring — pa.large_binary() → pa.binary(16) - delta_lake_databases: fix get_records_by_ids() to coerce pl.Series and pa.Array inputs through coerce_record_id, matching the Collection branch - delta_lake_databases: fix _create_record_ids_filter() type hint — list[str] → list[bytes] - test_operators: fix docstring — pa.large_binary() → pa.binary(16) - test_data_function: make UUID hex assertion precise — split on "::" and validate parts[1] exactly rather than searching anywhere in the string - test_postgresql_connector: add TestCoercePgValue with 5 focused tests for _coerce_pg_value (None passthrough, UUID→bytes for binary(16), UUID passthrough for other types, bytes passthrough, scalar passthrough) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T17:07:13Z

Copilot review round 2 — changes made (commit `d0cbe5a`)

UUIDv5 name ambiguity fixed (stream_builder.py)
_make_record_id() now uses the length-prefixed form f"{len(source_id)}:{source_id}:{provenance_token}" instead of f"{source_id}::{provenance_token}". A source_id whose length is embedded in the name gives an unambiguous parse boundary, so pairs that contain : or :: can no longer collide. Pre-v0.1.0 intentional hash change.

Missing coercion in get_records_by_ids() branches (delta_lake_databases.py)
The pl.Series and pa.Array/pa.ChunkedArray branches now apply coerce_record_id element-wise (same as the Collection branch), producing list[bytes] consistently across all input types.

_create_record_ids_filter() type hint (delta_lake_databases.py)
Parameter type updated from list[str] to list[bytes].

Stale docstrings (stream_builder.py, test_operators.py)
Both docstrings updated from pa.large_binary() to pa.binary(16).

More precise UUID hex assertion (test_data_function.py)
Test now splits on "::", validates exactly 3 segments, and uses re.fullmatch on parts[1] specifically — no longer susceptible to false positives from hex substrings in URI components.

_coerce_pg_value tests added (test_postgresql_connector.py)
New TestCoercePgValue class with 5 focused tests: None passthrough, uuid.UUID → .bytes for pa.binary(16), uuid.UUID passed through unchanged for other types, raw bytes passthrough, and plain scalar passthrough.

The length-prefix encoding introduced in d0cbe5a was unnecessary. In practice source_id is always a hex content hash (no "::" possible) and provenance_token is "row_N" or "col=value" — the "::" separator cannot appear in the hash path. A dedicated issue tracks formalising this invariant with validation (see PLT-XXXX). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eywalker · 2026-06-12T20:48:47Z


-        record_id = str(uuid7())
-        source_info = {k: combine(self.uri, (record_id,), (k,)) for k in output_data}
+        _uuid = uuid.UUID(bytes=uuid7().bytes)


let's use more descriptive variable name to avoid possible confusion with the uuid module. Consider something like new_record_uuid

Renamed to new_record_uuid throughout the function.

eywalker · 2026-06-12T20:50:23Z

Is this supposed to be uuid or bytes that we pass into record_id field for Data?

It was passing uuid.UUID — the intent was correct, the naming was misleading. Renamed the local variable to record_uuid and updated the keyword argument to Data(..., record_uuid=record_uuid).

eywalker · 2026-06-12T20:51:35Z

        python_schema: SchemaLike | None = None,
        meta_info: Mapping[str, DataValue] | None = None,
-        record_id: str | None = None,
+        record_id: uuid.UUID | None = None,


change this to record_uuid to clearly indicate it takes in uuid.UUID

Renamed record_id → record_uuid in Datagram.__init__, _init_from_dict, _init_from_table, and propagated to Tag.__init__ and Data.__init__ as well.

eywalker · 2026-06-12T20:52:51Z

-        record_id = result_table.to_pylist()[0][RECORD_ID_COL]
+        record_id_bytes = result_table.to_pylist()[0][RECORD_ID_COL]
+        # Convert bytes back to uuid.UUID (stored as binary(16) in the DB)
+        record_id = uuid.UUID(bytes=bytes(record_id_bytes)) if record_id_bytes is not None else None


explicitly name this record_uuid

Renamed — local variable is now record_uuid, passed as record_uuid=record_uuid to Data.

eywalker · 2026-06-12T20:56:08Z

+    if value is None:
+        return None
+    # psycopg returns uuid columns as uuid.UUID objects; binary(16) needs raw bytes
+    if arrow_type == _pa.binary(16) and hasattr(value, "bytes") and not isinstance(value, (bytes, bytearray)):


shouldn't this actually do isinstance check against uuid.UUID?

Replaced with isinstance(value, uuid.UUID). Also added import uuid at module level so the check doesn't need a local import.

eywalker · 2026-06-12T21:01:30Z

+        if raw is None:
+            raise ValueError("Missing 'uuid' field in struct dict")
+        content_hash = self._compute_content_hash(bytes(raw))
+        return self._format_hash_string(content_hash.digest, add_prefix=add_prefix)


It's not clear how this works -- isn't this supposed to be more like content_hash.to_string(add_prefix)

Removed _format_hash_string entirely — it was hardcoding sha256 in the format string independently of ContentHash.method, which would silently produce wrong output if the algorithm ever changes.

Replaced with _format_semantic_hash(content_hash, add_prefix) on SemanticStructConverterBase, which produces "{semantic_type_name}:{method}:{hex}" (e.g. "uuid:sha256:abc123") when add_prefix=True, and just the hex digest otherwise. The semantic type name prefix follows the pre-existing convention used by path converters.

Also fixed _compute_content_hash to use method="sha256" only — stuffing "{type}:sha256" into ContentHash.method was an abuse of that field's meaning. Filed PLT-1643 to evaluate whether the semantic-type prefix convention is correct at all, and whether the intermediate column type should be large_binary instead of large_string.

… isinstance, _format_hash_string - rename _uuid → new_record_uuid in data_function.py - rename local record_id → record_uuid in result_cache.py - rename Datagram/Tag/Data __init__ parameter record_id → record_uuid - postgresql_connector: replace duck-type hasattr check with isinstance(value, uuid.UUID) - remove _format_hash_string from SemanticStructConverterBase (hardcoded sha256 independent of ContentHash.method); replace with _format_semantic_hash(content_hash, add_prefix) which delegates to ContentHash.to_string and prepends the semantic type name - fix _compute_content_hash: method="sha256" only (semantic type name belongs in _format_semantic_hash, not stuffed into ContentHash.method) - remove test_format_hash_string (tests deleted method) Refs PLT-1643 for future decision on semantic-type prefix convention and large_string vs large_binary in SemanticHashingVisitor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T22:07:35Z

Response to eywalker review (round 3)

All 6 items addressed in commit 62c8f88.

1. _uuid → new_record_uuid (data_function.py) — renamed throughout the function.

2 & 4. record_id → record_uuid (result_cache.py) — local variable renamed to record_uuid; keyword arg updated to Data(..., record_uuid=record_uuid).

3. record_id parameter → record_uuid (datagram.py) — renamed in Datagram.__init__, _init_from_dict, _init_from_table, and propagated to Tag.__init__ and Data.__init__.

5. Duck-type → isinstance (postgresql_connector.py) — replaced hasattr(value, "bytes") and not isinstance(value, (bytes, bytearray)) with isinstance(value, uuid.UUID). Added import uuid at module level.

6. Remove _format_hash_string (semantic_struct_converters.py) — removed entirely; it was hardcoding sha256 independently of ContentHash.method. Also fixed _compute_content_hash to use method="sha256" only (not "{type}:sha256" — that was an abuse of the field). Replaced both call sites with _format_semantic_hash(content_hash, add_prefix), a new helper on SemanticStructConverterBase that produces "{semantic_type_name}:{method}:{hex}" or just the hex, keeping the pre-existing type:algorithm:hex convention intact.

During the investigation, two related open questions surfaced and were filed as PLT-1643: (a) whether the semantic-type prefix in hash strings is actually necessary for correctness, and (b) whether the intermediate pa.large_string() column in SemanticHashingVisitor should be pa.large_binary() instead. Both are deferred to that issue.

eywalker · 2026-06-12T22:30:59Z

+        digest = hashlib.sha256(content).digest()
+        return ContentHash(method="sha256", digest=digest)
+
+    def _format_semantic_hash(self, content_hash: ContentHash, add_prefix: bool) -> str:


Here add_prefix is being ignored -- if no intention of using it, drop it from the argument. Otherwise be sure to include it when generating the final string

Removed entirely. add_prefix wasn't doing anything useful — _format_semantic_hash was hardcoding prefix_method=True regardless. hash_struct_dict now always returns the fully-prefixed form "{type}:sha256:<hex>" with no parameter.

eywalker · 2026-06-12T22:31:23Z

-        content_hash = self._file_hasher.hash_file(path)
-        return self._format_hash_string(content_hash.digest, add_prefix=add_prefix)
+        file_hash = self._file_hasher.hash_file(path)
+        return self._format_semantic_hash(file_hash, add_prefix)


honestly add_prefix here seems unnecessary?

Agreed — dropped. hash_struct_dict now always returns the fully-prefixed "{type}:sha256:<hex>" form, making the parameter unnecessary.

eywalker · 2026-06-12T22:32:44Z


-        content_hash = self._file_hasher.hash_file(path)
-        return self._format_hash_string(content_hash.digest, add_prefix=add_prefix)
+        file_hash = self._file_hasher.hash_file(path)


The behavior of add_prefix = False is not correct here as it disagrees with the docstring. In case of add_prefix. = False, it should simply return file_hash.hex(False), right?

Removed add_prefix entirely rather than trying to fix the inconsistent behavior. The visitor was the only caller and always passed add_prefix=True, so there was no value in keeping the parameter. hash_struct_dict now unconditionally returns "{type}:sha256:<hex>", and the visitor just calls hash_struct_dict(data).

hash_struct_dict always returns the fully-prefixed form "{type}:sha256:<hex>", so the add_prefix parameter was redundant and misleading. Remove it from the protocol, both converter implementations, the visitor call site, and all tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T22:53:54Z

Round 4 review — addressed

Removed add_prefix from hash_struct_dict entirely (commit 2fab1b8).

The add_prefix parameter was simultaneously being ignored (hardcoded prefix_method=True inside _format_semantic_hash) and redundant (the visitor was the only caller and always passed True). Rather than patching the inconsistency, removed the parameter across the board:

SemanticStructConverterBase._format_semantic_hash — no longer takes add_prefix; always returns "{type}:{method}:{hex}"
PathStructConverterBase.hash_struct_dict — signature simplified to (self, struct_dict)
UUIDStructConverter.hash_struct_dict — signature simplified to (self, struct_dict)
SemanticStructConverterProtocol.hash_struct_dict — protocol updated to match
SemanticHashingVisitor — call site changed from hash_struct_dict(data, add_prefix=True) to hash_struct_dict(data)
Tests updated throughout (including test_file_hashing_consistency.py which was comparing against raw hex and needed updating)

All 3620 tests pass.

Copilot

Pull request overview

Copilot reviewed 54 out of 54 changed files in this pull request and generated 4 comments.

kurodo3 · 2026-06-12T23:50:51Z

+    Returns:
+        A deterministic ``uuid.UUID`` identifying this (source, row) pair.
+    """
+    name = f"{source_id}::{provenance_token}"
+    return uuid.uuid5(_SOURCE_RECORD_ID_NAMESPACE, name)


Pushing back on this one. The revert in b9a6934 was intentional: source_id is always a hex content hash (characters 0-9, a-f) and provenance_token is always "row_N" or "col=value" format — neither can contain ::. A future issue tracks adding explicit validation to enforce this invariant. No change here.

kurodo3 · 2026-06-12T23:51:06Z

    connector = SQLiteConnector(":memory:")   # PLT-1076
    db = ConnectorArrowDatabase(connector)
-    db.add_record(("results", "my_fn"), record_id="abc", record=table)
+    db.add_record(("results", "my_fn"), record_id=b"\\x00" * 16, record=table)


Fixed in commit 1886a4e. Changed to b"sha256:abc" to make clear record IDs are variable-length bytes (method-prefixed content hashes), not fixed-size UUIDs.

kurodo3 · 2026-06-12T23:51:05Z

            self._pending_batches[record_key] = pa.concat_tables(
                [existing_pending, records]
            )
-        pending_ids = cast(list[str], records[self.RECORD_ID_COLUMN].to_pylist())
+        pending_ids = cast(list[bytes], records[self.RECORD_ID_COLUMN].to_pylist())
        self._pending_record_ids[record_key].update(pending_ids)


Same fix as above — InMemoryArrowDatabase.add_records() now raises TypeError early for non-binary record-id columns. See reply on the connector_arrow_database.py comment for full details.

kurodo3 · 2026-06-12T23:50:59Z

        records = self._deduplicate_within_table(records)
        record_key = self._get_record_key(record_path)
-        input_ids = set(cast(list[str], records[self.RECORD_ID_COLUMN].to_pylist()))
+        input_ids = set(cast(list[bytes], records[self.RECORD_ID_COLUMN].to_pylist()))


Fixed in commit 1886a4e. Both ConnectorArrowDatabase.add_records() and InMemoryArrowDatabase.add_records() now raise TypeError immediately if the record-id column is not pa.binary() or pa.large_binary(). The cast(list[bytes], ...) annotation was hiding the silent failure — hard validation is consistent with the rest of the database layer's strict type contract.

Two internal callers that were building the hash column as pa.large_string() were also fixed: operator_node.py and cached_source.py now encode hash strings to bytes (h.encode()) so the stored column is pa.large_binary(). All test record IDs updated from str to bytes.

…ds() String-typed record-id columns were silently accepted but caused deduplication to always pass (set[str] & set[bytes] == empty set). Both InMemoryArrowDatabase and ConnectorArrowDatabase now raise TypeError immediately if the record-id column is not pa.binary() or pa.large_binary(). The two internal callers that were building the hash column as pa.large_string() are fixed to encode the hash strings to bytes (operator_node.py and cached_source.py). All tests updated to use bytes record IDs throughout. Also changes the docstring example in connector_arrow_database.py from b"\x00" * 16 (UUID-sized) to b"sha256:abc" to make clear record IDs are variable-length bytes, not fixed-size UUIDs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T23:51:20Z

Copilot review round — addressed (commit `1886a4e`)

Comment 1 — UUIDv5 name ambiguity (stream_builder.py): Declined. The :: separator is safe here: source_id is always a hex content hash (hex chars only, no ::) and provenance_token is always "row_N" or "col=value" format. This was intentionally reverted in b9a6934 after an earlier round over-fixed it. A future issue tracks formalising this with explicit input validation.

Comments 2 & 3 — String record-id column accepted silently (add_records() in both DB implementations): Fixed with hard type validation. Both InMemoryArrowDatabase and ConnectorArrowDatabase now raise TypeError at the top of add_records() if the record-id column is not pa.binary() or pa.large_binary(). The cast(list[bytes], ...) annotations were lying — they were type-hint-only and provided no runtime protection. This matches the strict type contract the rest of the database layer already enforces.

Two internal callers that were constructing the hash column as pa.large_string() were corrected: operator_node.py and cached_source.py now build the column as pa.large_binary() by encoding the hash hex strings via h.encode(). All database tests updated to use bytes record IDs.

Comment 4 — Docstring example looks UUID-specific: Fixed. Changed b"\x00" * 16 to b"sha256:abc" to make clear record IDs are variable-length bytes, not fixed-size UUIDs.

kurodo3 Bot and others added 14 commits June 11, 2026 23:45

docs(plans): add UUID Arrow type implementation plan (PLT-1162)

1cca130

feat(types): add UUID_ARROW_TYPE and UUID_STRUCT_ARROW_TYPE constants…

045803c

… (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(types): make UUID_ARROW_TYPE and UUID_STRUCT_ARROW_TYPE truly laz…

eaa27c2

…y via module __getattr__ (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(hashing): UUIDHandler returns bytes instead of str, consistent wi…

503b730

…th binary Arrow storage (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(hashing): fix stale UUIDHandler docstring; add missing test docs…

7b3101f

…tring (PLT-1162)

fix(core): correct stale comment and rid_val bytes fallback after UUI…

e391011

…D binary migration (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(types): add PLT-1615 cross-reference in UUID Arrow type spec (PL…

0c922b7

…T-1162)

kurodo3 Bot force-pushed the eywalker/plt-1162-design-spike-uuid-arrow-type-mapping-large_string-vs branch from 414d87e to 0c922b7 Compare June 11, 2026 23:47

eywalker requested changes Jun 11, 2026

View reviewed changes

eywalker requested changes Jun 12, 2026

View reviewed changes

kurodo3 Bot and others added 4 commits June 12, 2026 01:53

refactor(databases): rename _utils.py → utils.py

246e5d5

No leading underscore; the file is not private to the package, just a shared utility module. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eywalker requested a review from Copilot June 12, 2026 05:09

Copilot started reviewing on behalf of eywalker June 12, 2026 05:09 View session

eywalker requested changes Jun 12, 2026

View reviewed changes

Copilot AI reviewed Jun 12, 2026

View reviewed changes

eywalker requested a review from Copilot June 12, 2026 16:52

Copilot started reviewing on behalf of eywalker June 12, 2026 16:52 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

eywalker requested changes Jun 12, 2026

View reviewed changes

eywalker requested review from Copilot and eywalker June 12, 2026 23:04

Copilot started reviewing on behalf of eywalker June 12, 2026 23:04 View session

eywalker approved these changes Jun 12, 2026

View reviewed changes

Copilot AI reviewed Jun 12, 2026

View reviewed changes

eywalker approved these changes Jun 13, 2026

View reviewed changes

eywalker merged commit f572dc9 into main Jun 13, 2026
11 checks passed

eywalker deleted the eywalker/plt-1162-design-spike-uuid-arrow-type-mapping-large_string-vs branch June 13, 2026 00:08



		def __getattr__(name: str) -> object:
		"""Lazy resolution for module-level constants that depend on pyarrow.

		return f"row_{row_index}"


		def _make_record_id_bytes(source_id: str, provenance_token: str) -> bytes:

		@@ -783,7 +786,7 @@ def get_records_by_ids(
		elif isinstance(record_ids, (pa.Array, pa.ChunkedArray)):
		record_ids_list = cast(list[str], record_ids.to_pylist())

Conversation

kurodo3 Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Notes

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eywalker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot commented Jun 12, 2026

Review round 1 — addressed in commit 13a1d5a

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading

Review round 1 — addressed in commit `13a1d5a`

C1 — `uuid_utils` vs stdlib UUID (`datagram.py`)

C2 — `col_type` propagation through `_predict_system_tag_schema` (`join.py`)

C3 — `record_id` in dedup hash (`cached_source.py`)

C4 — Namespace URL / capitalisation (`stream_builder.py`)

C5 — `_make_record_id` return type (`stream_builder.py`)

C6 / C7 — Drop `str` from `connector_arrow_database.py`

C8 — Drop `large_string` from `delta_lake_databases.py`